Local BBC TV

Local TV News Endland

The BARB viewing figures for the local news 6:30 bulletin for BBC England from 2017.





BARB England local News Viewers
year average weekly viewers
2017 11,000,000
2018 10,000,000
2019 9,900,000
2020 11,000,000
2021 10,000,000
2022 8,600,000



BARB England local News Viewers
year week_commencing viewers
2017 2017-01-02 11,000,000
2017 2017-01-09 13,000,000
2017 2017-01-16 12,000,000
2017 2017-01-23 12,000,000
2017 2017-01-30 12,000,000
2017 2017-02-06 11,000,000
2017 2017-02-13 11,000,000
2017 2017-02-20 12,000,000
2017 2017-02-27 12,000,000
2017 2017-03-06 11,000,000
2017 2017-03-20 13,000,000
2017 2017-03-27 11,000,000
2017 2017-04-03 10,000,000
2017 2017-04-10 9,500,000
2017 2017-04-17 9,700,000
2017 2017-04-24 11,000,000
2017 2017-05-01 10,000,000
2017 2017-05-08 11,000,000
2017 2017-05-15 11,000,000
2017 2017-05-22 11,000,000
2017 2017-05-29 11,000,000
2017 2017-06-05 12,000,000
2017 2017-06-12 11,000,000
2017 2017-06-19 11,000,000
2017 2017-06-26 11,000,000
2017 2017-07-03 9,300,000
2017 2017-07-10 11,000,000
2017 2017-07-17 10,000,000
2017 2017-07-24 9,800,000
2017 2017-07-31 10,000,000
2017 2017-08-07 10,000,000
2017 2017-08-14 10,000,000
2017 2017-08-21 11,000,000
2017 2017-08-28 9,600,000
2017 2017-09-04 11,000,000
2017 2017-09-11 10,000,000
2017 2017-09-18 11,000,000
2017 2017-09-25 9,900,000
2017 2017-10-02 11,000,000
2017 2017-10-09 11,000,000
2017 2017-10-16 11,000,000
2017 2017-10-23 10,000,000
2017 2017-10-30 11,000,000
2017 2017-11-06 11,000,000
2017 2017-11-13 11,000,000
2017 2017-11-20 11,000,000
2017 2017-11-27 11,000,000
2017 2017-12-04 11,000,000
2017 2017-12-11 11,000,000
2017 2017-12-18 12,000,000
2018 2018-01-08 12,000,000
2018 2018-01-15 12,000,000
2018 2018-01-22 12,000,000
2018 2018-01-29 11,000,000
2018 2018-02-05 11,000,000
2018 2018-02-12 11,000,000
2018 2018-02-19 11,000,000
2018 2018-02-26 11,000,000
2018 2018-03-05 14,000,000
2018 2018-03-12 11,000,000
2018 2018-03-26 11,000,000
2018 2018-04-02 9,500,000
2018 2018-04-09 10,000,000
2018 2018-04-16 10,000,000
2018 2018-04-23 10,000,000
2018 2018-04-30 11,000,000
2018 2018-05-07 9,900,000
2018 2018-05-14 8,800,000
2018 2018-05-21 9,900,000
2018 2018-05-28 9,800,000
2018 2018-06-04 9,200,000
2018 2018-06-11 9,600,000
2018 2018-06-18 9,000,000
2018 2018-07-02 10,000,000
2018 2018-07-09 10,000,000
2018 2018-07-16 10,000,000
2018 2018-07-23 10,000,000
2018 2018-07-30 9,900,000
2018 2018-08-06 9,500,000
2018 2018-08-13 9,500,000
2018 2018-08-20 9,700,000
2018 2018-08-27 9,400,000
2018 2018-09-03 9,600,000
2018 2018-09-10 9,700,000
2018 2018-09-17 9,600,000
2018 2018-09-24 9,800,000
2018 2018-10-01 9,800,000
2018 2018-10-08 9,800,000
2018 2018-10-15 10,000,000
2018 2018-10-22 10,000,000
2018 2018-10-29 10,000,000
2018 2018-11-05 10,000,000
2018 2018-11-12 10,000,000
2018 2018-11-19 10,000,000
2018 2018-11-26 10,000,000
2018 2018-12-03 10,000,000
2018 2018-12-10 10,000,000
2018 2018-12-17 10,000,000
2018 2018-12-24 11,000,000
2019 2019-01-07 9,700,000
2019 2019-01-14 11,000,000
2019 2019-01-21 11,000,000
2019 2019-01-28 10,000,000
2019 2019-02-04 11,000,000
2019 2019-02-11 11,000,000
2019 2019-02-18 10,000,000
2019 2019-02-25 10,000,000
2019 2019-03-04 11,000,000
2019 2019-03-11 10,000,000
2019 2019-03-18 11,000,000
2019 2019-03-25 10,000,000
2019 2019-04-01 9,900,000
2019 2019-04-08 9,900,000
2019 2019-04-15 9,400,000
2019 2019-04-22 8,800,000
2019 2019-04-29 9,000,000
2019 2019-05-06 9,800,000
2019 2019-05-13 8,900,000
2019 2019-05-20 8,900,000
2019 2019-05-27 9,200,000
2019 2019-06-03 9,200,000
2019 2019-06-10 9,300,000
2019 2019-06-17 10,000,000
2019 2019-06-24 9,900,000
2019 2019-07-01 9,200,000
2019 2019-07-08 8,400,000
2019 2019-07-15 8,700,000
2019 2019-07-22 9,300,000
2019 2019-07-29 9,400,000
2019 2019-08-05 9,400,000
2019 2019-08-12 9,500,000
2019 2019-08-19 9,600,000
2019 2019-08-26 8,900,000
2019 2019-09-02 9,400,000
2019 2019-09-09 9,500,000
2019 2019-09-16 9,100,000
2019 2019-09-23 9,600,000
2019 2019-09-30 9,900,000
2019 2019-10-07 9,500,000
2019 2019-10-14 10,000,000
2019 2019-10-21 10,000,000
2019 2019-10-28 10,000,000
2019 2019-11-04 10,000,000
2019 2019-11-11 10,000,000
2019 2019-11-18 11,000,000
2019 2019-11-25 10,000,000
2019 2019-12-02 10,000,000
2019 2019-12-09 9,900,000
2019 2019-12-16 11,000,000
2019 2019-12-23 11,000,000
2020 2020-01-13 11,000,000
2020 2020-01-20 11,000,000
2020 2020-01-27 11,000,000
2020 2020-02-03 11,000,000
2020 2020-02-10 11,000,000
2020 2020-02-17 11,000,000
2020 2020-02-24 11,000,000
2020 2020-03-02 11,000,000
2020 2020-03-09 11,000,000
2020 2020-03-16 12,000,000
2020 2020-03-23 16,000,000
2020 2020-03-30 16,000,000
2020 2020-04-06 15,000,000
2020 2020-04-13 13,000,000
2020 2020-04-20 12,000,000
2020 2020-04-27 13,000,000
2020 2020-05-04 13,000,000
2020 2020-05-11 12,000,000
2020 2020-05-18 13,000,000
2020 2020-05-25 11,000,000
2020 2020-06-01 10,000,000
2020 2020-06-08 12,000,000
2020 2020-06-15 12,000,000
2020 2020-06-22 12,000,000
2020 2020-06-29 11,000,000
2020 2020-07-06 11,000,000
2020 2020-07-13 11,000,000
2020 2020-07-20 11,000,000
2020 2020-07-27 10,000,000
2020 2020-08-03 11,000,000
2020 2020-08-10 10,000,000
2020 2020-08-17 10,000,000
2020 2020-08-24 10,000,000
2020 2020-08-31 11,000,000
2020 2020-09-07 9,300,000
2020 2020-09-14 10,000,000
2020 2020-09-21 10,000,000
2020 2020-09-28 11,000,000
2020 2020-10-05 11,000,000
2020 2020-10-12 11,000,000
2020 2020-10-19 12,000,000
2020 2020-10-26 12,000,000
2020 2020-11-02 12,000,000
2020 2020-11-09 12,000,000
2020 2020-11-16 12,000,000
2020 2020-11-23 11,000,000
2020 2020-11-30 12,000,000
2020 2020-12-07 11,000,000
2020 2020-12-14 11,000,000
2020 2020-12-21 11,000,000
2021 2021-01-04 13,000,000
2021 2021-01-11 13,000,000
2021 2021-01-18 13,000,000
2021 2021-01-25 12,000,000
2021 2021-02-01 12,000,000
2021 2021-02-08 12,000,000
2021 2021-02-15 11,000,000
2021 2021-02-22 12,000,000
2021 2021-03-01 12,000,000
2021 2021-03-08 11,000,000
2021 2021-03-15 12,000,000
2021 2021-03-22 11,000,000
2021 2021-03-29 9,800,000
2021 2021-04-05 9,900,000
2021 2021-04-12 10,000,000
2021 2021-04-19 9,800,000
2021 2021-04-26 9,800,000
2021 2021-05-03 9,200,000
2021 2021-05-10 10,000,000
2021 2021-05-17 9,900,000
2021 2021-05-24 9,500,000
2021 2021-05-31 8,100,000
2021 2021-06-07 9,000,000
2021 2021-06-14 9,300,000
2021 2021-06-21 11,000,000
2021 2021-06-28 7,800,000
2021 2021-07-05 11,000,000
2021 2021-07-12 9,400,000
2021 2021-07-19 9,200,000
2021 2021-07-26 10,000,000
2021 2021-08-02 10,000,000
2021 2021-08-09 8,800,000
2021 2021-08-16 9,000,000
2021 2021-08-23 9,000,000
2021 2021-08-30 8,600,000
2021 2021-09-06 9,100,000
2021 2021-09-13 9,000,000
2021 2021-09-20 9,200,000
2021 2021-09-27 9,600,000
2021 2021-10-04 9,500,000
2021 2021-10-11 9,300,000
2021 2021-10-18 9,600,000
2021 2021-10-25 9,400,000
2021 2021-11-01 9,800,000
2021 2021-11-08 9,700,000
2021 2021-11-15 10,000,000
2021 2021-11-22 9,600,000
2021 2021-11-29 10,000,000
2021 2021-12-06 10,000,000
2021 2021-12-13 10,000,000
2021 2021-12-20 9,600,000
2022 2022-01-03 9,200,000
2022 2022-01-10 10,000,000
2022 2022-01-17 10,000,000
2022 2022-01-24 9,700,000
2022 2022-01-31 9,700,000
2022 2022-02-07 9,300,000
2022 2022-02-14 10,000,000
2022 2022-02-21 11,000,000
2022 2022-02-28 11,000,000
2022 2022-03-07 10,000,000
2022 2022-03-14 9,700,000
2022 2022-03-21 9,500,000
2022 2022-03-28 9,200,000
2022 2022-04-04 8,700,000
2022 2022-04-11 7,600,000
2022 2022-04-18 7,700,000
2022 2022-04-25 8,500,000
2022 2022-05-02 7,500,000
2022 2022-05-09 8,300,000
2022 2022-05-16 8,400,000
2022 2022-05-23 7,800,000
2022 2022-05-30 7,400,000
2022 2022-06-06 8,000,000
2022 2022-06-13 7,500,000
2022 2022-06-20 7,800,000
2022 2022-06-27 6,800,000
2022 2022-07-04 6,500,000
2022 2022-07-11 7,600,000
2022 2022-07-18 8,300,000
2022 2022-07-25 8,200,000
2022 2022-08-01 8,200,000
2022 2022-08-08 8,000,000
2022 2022-08-15 7,800,000
2022 2022-08-22 8,200,000
2022 2022-08-29 7,400,000
2022 2022-09-05 7,400,000





Linear Model

Overview

A linear model is the simplest method to forecast. The best fit for a linear trend line is created from the historical data and projected onto future dates. The linear model, at it’s most basic, is nothing more than a line of best fit.

From the graph showing TV viewing, a few trends can be observed.

  1. A general decline in viewing over time.
  2. A seasonal trend with higher viewing in the winter months and lower viewing in the summer.
  3. A significant boost throughout the Covid period from early 2020 until early 2021.



Simple trend


Adding a simple linear trend to the graph produces the forecast shown below in black where a downwards trend is given. This trend has no seasonality and looks to be inflated by the higher than typical viewing during Covid. A blue trend line was also added which uses historic data from pre-Covid and shows a much steeper downwards trend.



The linear model so far is of the form y ~ x where y is the number of viewers and x is the week commencing. However the trends identified visually are not dependent on the factor ‘week commencing’ itself.

  1. The downwards trend is linked to the year of viewing.
  2. The seasonal viewing is linked to the quarter as there is lower viewing in the summer quarter and higher in the winter quarters.

Plotted on the graph are the linear models giving:

  • viewers ~ year in red,
  • viewers ~ quarter in green
  • viewers ~ week in purple

Using the year gives the gradual decline over all we would expect, whilst using the quarter or the week shows a seasonal variation. However, the seasonal trend identifiable shows lower viewing in the summer months and higher viewing in the winter, but both the quarter and week linear model show the highest point at new year and a gradual decline throughout the year which is not correct.


One way to artificially create the seasonal trend is to add an x3 term to the year based model. This would take the form viewers ~ year + poly(week,3) and gives the desired seasonal trend.

A final step would be to include some dummy data to show that Covid is an inflated period: here the data is the Google mobility trends where values during the Covid period are negative, to decrease traffic, and outside of the Covid period they’re zero. This would be included in the model in the form viewers ~ year + poly(week,3) + covid and is plotted on the graph in red.

This shows the seasonal trend as desired, the gradual yearly trend, and understands that the rise from Covid is not an indicator of future viewer numbers.

However, using the linear model in this way relied on a visual interpretation of the trend, particularly the seasonal one, rather than a statistical look at the data and a forecast model based upon this.



##   year quarter year_quarter week_commencing week covid viewers
## 1 2022       3      2022_q3      2022-09-12   37     0 8360252
## 2 2022       3      2022_q3      2022-09-19   38     0 8402554
## 3 2022       3      2022_q3      2022-09-26   39     0 8449077
## 4 2022       4      2022_q4      2022-10-03   40     0 8499753
## 5 2022       4      2022_q4      2022-10-10   41     0 8554516
## 6 2022       4      2022_q4      2022-10-17   42     0 8613300
Forecast wiewers numbers for England local News
year average weekly viewers percentage decrease from 2017
2017 11,000,000 0.0
2018 10,000,000 -5.5
2019 9,900,000 -9.1
2020 11,000,000 5.6
2021 10,000,000 -6.8
2022 8,600,000 -21.0
2022 8,800,000 -19.0
2023 8,300,000 -23.0
2024 7,900,000 -27.0
2025 7,500,000 -31.0
2026 7,100,000 -35.0
2027 6,700,000 -39.0
2028 6,200,000 -43.0
2029 5,800,000 -46.0
2030 5,400,000 -50.0




ARIMA Modelling

ARIMA Modelling

Generally, any time series can be written as a series of three components: a trend, a seasonal component and a random component.

Auto-Regressive Integrated Moving Average is a more complex forecasting technique that looks at auto-correlations and moving averages, comparing the data at one period in time to the previous period.

The ARIMA model is mathematically of the form

\(\begin{aligned} y_t = & \delta + \epsilon_t + \\ & \phi_1 y_{t-1} + ... \phi_{1} y_{t-p} + \\ & \theta_1 \epsilon_{t-1} + ... \theta_{1} \epsilon_{t-q} \end{aligned}\)

The first terms \(\delta\) and \(\epsilon_{t}\) are essentially a white noise component with \(\delta\) a constant and \(\epsilon_{t}\) an error term for the given time which shows the deviation from the constant \(\delta\).

The terms given on the next line make the auto-regressive part of the formula where the components dependent on previous values of \(y\) at given times \(t\) multiplied by a scale factor \(\phi\). The previous \(p\) number of terms are used. For example if p = 2 then the value of y at the previous time \(y_{t-1}\) and the previous time \(y_{t-2}\) would be included.

The terms on the bottom line make the moving average part of the formula, where the difference between the previous value of y (written as \(y_{t-1}\)) and the moving average is given as \(\epsilon_{t-1}\) multiplied by a scale factor \(\theta\). Similar to the auto-regressive part, the previous \(q\) terms are used.


The ARIMA model is written in R in the form:

forecast::Arima(
  time_series, 
  order  = c(p,d,q),
  seasonal = list(order = c(P,D,Q), period = n),
             include.drift = TRUE
             )

where the terms given are (in brief)

  • data as a time series object
  • order terms for non-seasonal part of the data
  • seasonal terms for a given period (i.e. n = 52 for weekly)
  • ‘p’ is the number of terms to lag for in the auto-regressive component
  • ‘d’ is the number of times a differencing is applied
  • ‘q’ is the lag for the moving average


Determining the components for the ARIMA Model


Accounting for Covid

With the linear model it was simple to add in the Covid factor as a variable. The ARIMA process doesn’t have this option so the initial data set will be scaled based on the covid factor make this example process simpler . This is show below with the red points the true data for the Covid period and the blue points the scaled values. The data pre and post Covid remains the same.


Converting to a time series

The TV data is very simply converted to a time series.

library(tseries);library(zoo);library(forecast)

## the data frame
tv %>% head()
##    region year quarter year_quarter week week_commencing  viewers covid
## 1 England 2017       1      2017_q1    1      2017-01-02 11308900     0
## 2 England 2017       1      2017_q1    2      2017-01-09 12659000     0
## 3 England 2017       1      2017_q1    3      2017-01-16 12093200     0
## 4 England 2017       1      2017_q1    4      2017-01-23 12086500     0
## 5 England 2017       1      2017_q1    5      2017-01-30 11540600     0
## 6 England 2017       1      2017_q1    6      2017-02-06 11328900     0
# Convert to time series
data_ts<-ts(tv$viewers, 
            freq= 365.25/7, #to get weeks in a year 
            start=decimal_date(tv$week_commencing %>% min())
)
data_ts %>% head()
## Time Series:
## Start = 2017.00273972603 
## End = 2017.09856450358 
## Frequency = 52.1785714285714 
## [1] 11308900 12659000 12093200 12086500 11540600 11328900
## plot the data to show it's still of the same shape
plot(data_ts)


As discussed in the linear model section, a long term trend and a seasonal trend can both be identified visually but, whereas the linear model approximated relationships and added a simple x3 term to make the season trend appear correct, the ARIMA model uses statistical techniques to determine what trends are truly in the data.

Decomposing the series is a good way to gain an initial understanding.

plot(decompose(data_ts))

The trend section of the data shows the general decline but shows a leveling off during Covid. The season trend is clearly identified and shows the trend we know to be true of a rise in the winter months and a drop in the summer. The final section was automatically identified as being random fluctuations.

Creating a stationary series

A stationary series is one where the observed value does not depend on the time and neither do the statistical properties such as mean or variance.

In a stationary series visually you would be able to see fluctuation from day to day, but no seasonality and no trend over time. To check this statistically the Augmented Dickey Fuller (ADF) test is used. If the p-value is less than 0.05 there is 95% confidence in the null hypothesis and the series is stationary.

## test for stationarity
adf.test(data_ts)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  data_ts
## Dickey-Fuller = -3.5967, Lag order = 6, p-value = 0.03384
## alternative hypothesis: stationary

The p-value is below 0.05 suggesting the data is stationary, but visual observations very much suggest it isn’t.

The most common way to make a series stationary is to take the difference between one observation and the next, then test again for stationarity. If that series is not stationary, the difference may be taken again. The log of the values is also often taken to reduce the effect of extreme variation in the size of values.

The function ndiffs() gives a suggested number of time the differencing is done.

## What is the suggested differencing?
ndiffs(data_ts)
## [1] 1
## What does this look like?
plot(data_ts %>% log %>%  diff())
abline(h=0, col="red")

## Test for stationarity
adf.test(data_ts %>%log %>%  diff())
## 
##  Augmented Dickey-Fuller Test
## 
## data:  data_ts %>% log %>% diff()
## Dickey-Fuller = -7.7555, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary

The p-value is now lower than before, and lower than the 0.05 threshold so the null hypothesis can be accepted and the data series deemed stationary. As the data was differenced once to achieve this stationarity the ARIMA order value d takes the value one.



Seasonality and Cyclicity

Visual observations and the decomposition plot clearly show seasonality and cyclicity so the auto-regressive and moving average parts will need be be identified.

To determine the values for the order terms p and q an auto-correlation plot (ACF) and a partial auto-correlation plot (PACF) are plotted. The ACF is created by looking at the correlation of a term with the term once prior, then the term twice prior, then three times and so on. Below is an example of how this is done manually and with one simple line of code.

## create the differenced data to mimic the differenced time series data
x<-tv %>% select(viewers) %>% mutate(diff = log(viewers) - log(lag(viewers)))%>% tail(-1)
head(x)
##    viewers          diff
## 2 12659000  0.1127783983
## 3 12093200 -0.0457251135
## 4 12086500 -0.0005541839
## 5 11540600 -0.0462178745
## 6 11328900 -0.0185142698
## 7 11477600  0.0130403269
## create three data sets with the actual data, one lag, and a second lag
x_0 <- x$diff  %>% head(-1)
x_1 <- x$diff %>% tail(-1)
x_2 <- x$diff %>% tail(-2)
## to illustrate the lags
cbind(x_0,x_1,x_2) %>% head()
##                x_0           x_1           x_2
## [1,]  0.1127783983 -0.0457251135 -0.0005541839
## [2,] -0.0457251135 -0.0005541839 -0.0462178745
## [3,] -0.0005541839 -0.0462178745 -0.0185142698
## [4,] -0.0462178745 -0.0185142698  0.0130403269
## [5,] -0.0185142698  0.0130403269  0.0340801445
## [6,]  0.0130403269  0.0340801445 -0.0303841185
## look at how correlated one lag is
plot(x_0, x_1)
abline(reg=lm(x_0 ~ x_1)) 
text(0.2,0.3, paste0('cor = ',round(cor(x_0,x_1),2)))

## look at how correlated two lag is
plot(x_0 %>% head(-1), x_2)
abline(reg=lm(x_0%>% head(-1) ~ x_2)) 
text(0.2,0.3, paste0('cor = ',round(cor(x_0 %>% head(-1),x_2),2)))

The correlations for the first two logs that were done manually are given below. Comparatively, the ACF plot finds the correlations for many lags and plots them.

## the correlation coeficients for lags 1 and 2
print(signif(cor(x_0,x_1),2) )
## [1] -0.41
print(signif(cor(x_0 %>% head(-1),x_2),2))
## [1] 0.0072
## the ACF will find each coefficient and then plot the results. 
acf(x$diff, plot = FALSE) %>% head(n=6)
## 
## Autocorrelations of series 'x$diff', by lag
## 
##      1      2      3      4      5      6 
## -0.407  0.007 -0.058  0.018 -0.038  0.111
acf(x$diff, plot = TRUE, lag = 120 )

# Using the time series gives exactly the same result
#acf(data_ts %>%log %>%  diff())

The correlation with lag one data is significant as it falls outside the threshold shown by the blue lines. This means the order value would be q = 1.

Within this plot there are two peaks falling just outside the threshold around lag 50 and lag 100 suggesting an annual correlation. This suggests that there is a seasonal component for the model so Q = 1 would be a good starting point.

The second technique uses a partial auto-correlation technique where the correlation is considered but any related confounding variables are removed. The PACF function is given below, and as with the ACF, as the lag 1 term is above the significance threshold the order factor is determined to be p = 1.

pacf(x$diff, plot = FALSE) %>% head(n=6)
## 
## Partial autocorrelations of series 'x$diff', by lag
## 
##      1      2      3      4      5      6 
## -0.407 -0.190 -0.164 -0.099 -0.109  0.050
pacf(x$diff, lag = 120 )

#pacf(data_ts %>%log %>%  diff()) equivalent for the time series data

Here there seems to be another spike at around 100 or two years, suggesting there may be a seasonal P component, but as there is not a similar peak around week 52 this is less clear.


Modelling using ARIMA

From the test for stationarity, the ACF, and the PACF the order values for the ARIMA were determined:

  • p = 1
  • d = 1
  • q = 1

An ARIMA model can then be used to forecast the data and provides a series of coefficients. The seasonal components P,D,Q are harder to identify. The ACF and PCF suggested seasonal components should be included. Initially use the same values as the order terms, with the period 52 weeks.

fit <- forecast::Arima(data_ts %>% log(), 
                       order  = c(1,1,1),
                       seasonal = list(order = c(1,1,1), period = 52),
                       #include.drift = TRUE,
                        method = 'CSS'
                       )
fit
## Series: data_ts %>% log() 
## ARIMA(1,1,1)(1,1,1)[52] 
## 
## Coefficients:
##          ar1     ma1     sar1     sma1
##       0.2551  -0.829  -0.3403  -0.4160
## s.e.  0.0862   0.049   0.0732   0.0946
## 
## sigma^2 = 0.00463:  log likelihood = 268.85

The general equation is of the form

\(\begin{aligned} y_t = & \delta + \epsilon_t + \\ & \phi_1 y_{t-1} + ... \phi_{1} y_{t-p} + \\ & \theta_1 \epsilon_{t-1} + ... \theta_{1} \epsilon_{t-q} \end{aligned}\)

With the coefficients given this can be re-written as

\(\begin{aligned} y_t -y_{t-1} = & \delta + \epsilon_t + \\ & 0.2551 y_{t-1} + ... \phi_{1} y_{t-p} + \\ & -0.829 \epsilon_{t-1} + ... \theta_{1} \epsilon_{t-q} \end{aligned}\)

The \(y_t -y_{t-1}\) term is because the differenced data is being used and the coefficients are given by the model.

Once the moving average and autocorrelation parts are calculated, all that should remain is the white noise component.

The seasonal MA and AR components mathematically look the same, but have different values for the co-efficient and use the lag of the previous seasonal value. For example \(\phi_1 y_{t-1}\) for seasonal with period 52 weeks would take value \(-0.3403 y_{t-52}\).


Evaluating the model

The residuals from the fit should appear as white noise.

checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)(1,1,1)[52]
## Q* = 61.113, df = 53, p-value = 0.2075
## 
## Model df: 4.   Total lags used: 57

Plotting the residuals shows a stationary series with no trend. The residuals are normally distributed around a mean of one. The autocorrelation of the residuals shows a significant peak around lag 50. This suggests the seasonal term value P,D,Q may need altering.

pred <- predict(fit, n.ahead = 52*9)

ts.plot(data_ts , 2.718^pred$pred, log = "y", lty = c(1,3), xlab="time",ylab = "viewers (mil)")

The prediction into the future looks fairly good, but as the ACF residual has a significant value it would be worth comparing to a slightly different value.

Using seasonal components P,Q,D of 1,1,0 and 0,1,1 does not change the residual in the ACF that is significant, but using 1,0,1 does remove it however, as show below, that change produces no usable forecast.

fit2 <- forecast::Arima(data_ts %>% log(),
                       order  = c(1,1,1),
                       seasonal = list(order = c(1,0,1), period = 52),
                       #include.drift = TRUE,
                        method = 'CSS'
                       )
fit2
## Series: data_ts %>% log() 
## ARIMA(1,1,1)(1,0,1)[52] 
## 
## Coefficients:
##          ar1      ma1    sar1    sma1
##       0.1257  -0.6632  0.0536  0.0553
## s.e.  0.0976   0.0738  0.1342  0.1437
## 
## sigma^2 = 0.00374:  log likelihood = 366.09
checkresiduals(fit2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)(1,0,1)[52]
## Q* = 53.122, df = 53, p-value = 0.4695
## 
## Model df: 4.   Total lags used: 57
pred2 <- predict(fit2, n.ahead = 52*9)
ts.plot(data_ts , 2.718^pred2$pred, log = "y", lty = c(1,3), xlab="time",ylab = "viewers (mil)")

Auto Arima

A useful technique is to use the auto.arima function as a starting point for deciding upon the components p,d,q and P,D,Q.

For this case, the auto arima function gives ARIMA(0,1,1)(0,0,1)[52] which is quite different to that manually decided upon from observing the plots. The auto arima residuals are very good, showing white noise behaviour, but the forecast is not usable.

auto_fit <- forecast::auto.arima(data_ts %>% log())
auto_fit
## Series: data_ts %>% log() 
## ARIMA(0,1,1)(0,0,1)[52] 
## 
## Coefficients:
##           ma1    sma1
##       -0.5880  0.0994
## s.e.   0.0526  0.0666
## 
## sigma^2 = 0.004408:  log likelihood = 370.42
## AIC=-734.84   AICc=-734.75   BIC=-723.87
checkresiduals(auto_fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)(0,0,1)[52]
## Q* = 45.524, df = 55, p-value = 0.8151
## 
## Model df: 2.   Total lags used: 57
auto_pred <- predict(auto_fit, n.ahead = 52*9)
ts.plot(data_ts , 2.718^auto_pred$pred, log = "y", lty = c(1,3), xlab="time",ylab = "viewers (mil)")

Model Comparison


In summary, the ARIMA model appears to be more scientific than the linear model as it tests for stationarity and separates out each component to be dealt with individually. Choosing the components p,d,q and P,D,Q can be determined from the ACF and PACF plots, but ultimately a model needs to be built to determine if the forecast is useful.

The linear model appears less thorough than the ARIMA model and does not split out the seasonal and moving average components. But the linear model can account for many factors such as the need to scale some data due to Covid. The ARIMA model requires the data to be pre-processed to account for this.


Both the ARIMA and linear model forecasts show the same seasonal trends but the ARIMA shows a steeper long term decline. This means that by 2030 the linear model is forecasting a 50% reduction in traffic from 2017 whilst the ARIMA is predicting a 61% reduction. Visually, the shape of the ARIMA model appears to better reflect the seasonal trend in the data and the line of best fit appears to better reflect the previous data

notes: project line of best fit backwards put on true data without covid being accounted for

try for other data sets

Linear Model vs Arima Forecast comparison
Year Linear Model Linear % Change Arima Arima % Change
2017 11,000,000 0% 11,000,000 0%
2018 10,000,000 -5.5% 10,000,000 -5.5%
2019 9,900,000 -9.1% 9,900,000 -9.1%
2020 9,900,000 -9.3% 9,900,000 -9.3%
2021 9,200,000 -16% 9,200,000 -16%
2022 8,600,000 -21% 8,400,000 -23%
2023 8,300,000 -23% 7,600,000 -30%
2024 7,900,000 -27% 7,000,000 -35%
2025 7,500,000 -31% 6,400,000 -41%
2026 7,100,000 -35% 5,900,000 -46%
2027 6,700,000 -39% 5,400,000 -50%
2028 6,200,000 -43% 5,000,000 -54%
2029 5,800,000 -46% 4,600,000 -58%
2030 5,400,000 -50% 4,200,000 -61%






– talk about the effect changing to 101 has with athe ACF plot and the forecast – plot the graph nicely in ggplot – add a block that’s auto-arima – add a summary

##   log_data previous_y          diff  new_diff    new_y   fitted   missing
## 1 16.24110         NA            NA        NA       NA 16.24110        NA
## 2 16.35388   16.24110  0.1127783983 0.9496783 17.19078 16.35388 0.8368999
## 3 16.30815   16.35388 -0.0457251135 1.0204025 17.37428 16.30815 1.0661277
## 4 16.30760   16.30815 -0.0005541839 1.0002473 17.30840 16.30760 1.0008015
## 5 16.26138   16.30760 -0.0462178745 1.0206224 17.32822 16.26138 1.0668403
## 6 16.24287   16.26138 -0.0185142698 1.0082611 17.26964 16.24287 1.0267753
##          x   fitted residuals diff
## 1 16.24110 16.24110         0    0
## 2 16.35388 16.35388         0    0
## 3 16.30815 16.30815         0    0
## 4 16.30760 16.30760         0    0
## 5 16.26138 16.26138         0    0
## 6 16.24287 16.24287         0    0

\[ y_t = \delta + 0.0633 y_{t-1} \notag\\ \epsilon + -0.5094 y_{t-1} \]

An example of a stationary series is the change in Google stock price from one day to the next. The graph appears visually stationary about the mean of zero, but the Augmented Dickey-Fuller Test can be used to statistically test for stationarity.


The ADF test has a null hypothesis that there is a time dependent relationship and an alternative hypothesis that there is not, i.e the series is stationary. For this data the p-value is 0.01 which is below the 0.05 (5%) value showing that the result is statistically valid and the null hypotesis should be rejected and the alternative accepted. The data is stationary.